From the Pareto Principle, 80% of sales come from 20% of customers. Usually your business is largely supported by a fraction of your customer base - your best customers.
From a marketing perspective, it is important to understand the characteristics and preferences of your best customers for at least 2 reasons:
To identify the best customers who will be loyal from the start
Perform customer segmentation using RFM clustering to identify the best customers
Idea behind RFM:
R - Recency : Customers who have purchased from your store recently are more likely to convert again than those who haven't visit your store for a while
F - Frequency : Customers who buy from your store more often are more likely to buy again than those who buy infrequenctly
M - Monetary : Customers who spend more are more likely to buy again than those who spend less
Load the dataset and proceed data cleaning
K-means Clustering to create the Recency, Frequency and Monetary scores for each customer
Find out the optimal no of clusters to be applied for each attribute. Aim to assign same no of clusters for each attribute, a higher cluster no (score) should reflect a better result,i.e. cluster 3 should be better than cluster 1.
- Recency: number of days since most recent purchase date (More #days would have lower cluster#)
- Frequency: number of transactions within the period (higher freq. would have higher cluster#)
- Monetary: total sales attributed to the customer (higher sales would have higher cluster#)
Compare the RFM clusters (observe the relationship against each other) and calculate the total score
- Champions: our best customers!
- Loyalists
- New Customers
- Can’t Lose Them
- Slipping
This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.
Data Source: https://www.kaggle.com/mathchi/online-retail-ii-data-set-from-ml-repository
Reference
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import plotly.express as px
plt.style.use('seaborn')
import warnings
warnings.filterwarnings("ignore")
df = pd.read_excel('online_retail_II.xlsx')
# create a copy of original dataset
df_org = df.copy()
df.head()
| Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer ID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 2009-12-01 07:45:00 | 6.95 | 13085.0 | United Kingdom |
| 1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 2009-12-01 07:45:00 | 6.75 | 13085.0 | United Kingdom |
| 3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 2009-12-01 07:45:00 | 2.10 | 13085.0 | United Kingdom |
| 4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 2009-12-01 07:45:00 | 1.25 | 13085.0 | United Kingdom |
df.describe()
| Quantity | Price | Customer ID | |
|---|---|---|---|
| count | 525461.000000 | 525461.000000 | 417534.000000 |
| mean | 10.337667 | 4.688834 | 15360.645478 |
| std | 107.424110 | 146.126914 | 1680.811316 |
| min | -9600.000000 | -53594.360000 | 12346.000000 |
| 25% | 1.000000 | 1.250000 | 13983.000000 |
| 50% | 3.000000 | 2.100000 | 15311.000000 |
| 75% | 10.000000 | 4.210000 | 16799.000000 |
| max | 19152.000000 | 25111.090000 | 18287.000000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 525461 entries, 0 to 525460 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Invoice 525461 non-null object 1 StockCode 525461 non-null object 2 Description 522533 non-null object 3 Quantity 525461 non-null int64 4 InvoiceDate 525461 non-null datetime64[ns] 5 Price 525461 non-null float64 6 Customer ID 417534 non-null float64 7 Country 525461 non-null object dtypes: datetime64[ns](1), float64(2), int64(1), object(4) memory usage: 32.1+ MB
df.shape
(525461, 8)
df = df_org.copy()
df.isnull().sum()/df.shape[0]
Invoice 0.000000 StockCode 0.000000 Description 0.005572 Quantity 0.000000 InvoiceDate 0.000000 Price 0.000000 Customer ID 0.205395 Country 0.000000 dtype: float64
# remove data with missing customer ID
df.dropna(subset=['Customer ID'],inplace=True)
df.isnull().sum()/df.shape[0]
Invoice 0.0 StockCode 0.0 Description 0.0 Quantity 0.0 InvoiceDate 0.0 Price 0.0 Customer ID 0.0 Country 0.0 dtype: float64
df = df.reset_index(drop=True)
df.shape
(417534, 8)
Recency
We need to find out the most recent purchase date of each customer and check out how many days since that purchase. Then, we can apply K-means clustering to assign each customer a recency score.
#find the most recent purchase date for each customer
df_user = df.groupby('Customer ID')['InvoiceDate'].max().reset_index()
df_user.columns = ['CustomerID','MaxPurchaseDate']
#set the most recent purchase date in the dateset as our observation point
#recency = our observation point - most recent purchase date of each user
df_user['Recency'] = df_user['MaxPurchaseDate'].max() - df_user['MaxPurchaseDate']
#Leave only the customerID and recency columns
df_user = df_user.drop('MaxPurchaseDate',axis=1)
#convert Recency to integer
df_user['Recency'] = pd.to_timedelta(df_user.Recency, errors='coerce').dt.days
df_user.head(2)
| CustomerID | Recency | |
|---|---|---|
| 0 | 12346.0 | 66 |
| 1 | 12347.0 | 2 |
df_user.Recency.describe()
count 4383.000000 mean 90.395163 std 98.176493 min 0.000000 25% 16.000000 50% 51.000000 75% 137.000000 max 373.000000 Name: Recency, dtype: float64
Average Recency is 90 days and median Recency is 51 days.
#plot a histogram to show the recency distribution
fig = px.histogram(df_user, x="Recency",nbins=30,labels={'Recency':'Recency(#days)'},title='Recency distribution')
fig.show()
Most are recent shoppers, less than 100 days.
K-means Clustering to assign recency score
The goal of K-means clustering is to group data points into distinct non-overlapping subgroups. The mathematics behind K-means clustering, involves minimizing the sum of square of distances sse between the cluster centroid and its associated data points.
K-means clustering could be used to examine multiple variables, but in this exercise I just focus on 1 variable (R/F/M) each time.
from sklearn.cluster import KMeans
#to store sum of squares values
sse = {}
for k in range(1,10):
kmeans = KMeans(n_clusters=k,max_iter=1000,init='k-means++').fit(df_user[['Recency']])
#df_user['cluster'] = kmeans.labels_
sse[k] = kmeans.inertia_
#The elbow method
plt.figure()
plt.plot(list(sse.keys()),list(sse.values()))
plt.title("The Elbow Method")
plt.ylabel('sum of square of distances')
plt.xlabel('# clusters')
plt.axvspan(3, 4, facecolor='yellow', alpha=0.3)
plt.show()
The Elbow Method
The sse(sum of squares) value decreases as number of clusters increases, we aim to get an optimal point after which the reduction of sse is only minimal. In this case, number of clusters = 3 would be the optimal point or we can also choose 4 which resulted in even lower sse.
#set no of clusters = 4
kmeans = KMeans(n_clusters=4,init='k-means++')
kmeans.fit(df_user[['Recency']])
#group the customer into different clusters
df_user['RecencyCluster'] = kmeans.predict(df_user[['Recency']])
#df_user.groupby('RecencyCluster')['Recency'].describe()
#function to reorder the cluster number so that there is a order for cluster# vs. recency score
def cluster_reorder(df,cluster_field,variable,boolean_asc):
df_mean = df.groupby(cluster_field)[variable].mean().sort_values(ascending=boolean_asc).reset_index()
df_mean['new_cluster'] = df_mean.index
df_mean.drop(variable,inplace=True,axis=1)
df_new = pd.merge(df,df_mean,how='left',on=cluster_field)
df_new.drop(cluster_field,inplace=True,axis=1)
df_new.rename(columns={'new_cluster':cluster_field},inplace=True)
return df_new
#the higher cluster#, the more recent visiter
df_user = cluster_reorder(df_user,'RecencyCluster','Recency',False)
df_user.groupby('RecencyCluster')['Recency'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| RecencyCluster | ||||||||
| 0 | 473.0 | 307.221987 | 40.481993 | 252.0 | 269.0 | 302.0 | 357.0 | 373.0 |
| 1 | 661.0 | 188.461422 | 33.270204 | 133.0 | 161.0 | 186.0 | 217.0 | 247.0 |
| 2 | 1149.0 | 76.841601 | 22.529407 | 48.0 | 58.0 | 71.0 | 92.0 | 132.0 |
| 3 | 2100.0 | 18.105714 | 13.278964 | 0.0 | 7.0 | 16.0 | 28.0 | 46.0 |
Cluster 3 is the best with most recent visits, and cluster 0 is the worst.
Frequency
We need to find out no of transaction for each customer which would be the frequency. Then, we can apply K-means clustering to assign each customer a frequency score.
#get no of transaction for each customer
df_freq = df.groupby('Customer ID')['Invoice'].count().reset_index()
df_freq.columns=['CustomerID','Frequency']
#merge the frequency col to our existing dataframe containing Recency scores
df_user = df_user.merge(df_freq,on='CustomerID')
df_user.head()
| CustomerID | Recency | RecencyCluster | Frequency | |
|---|---|---|---|---|
| 0 | 12346.0 | 66 | 2 | 46 |
| 1 | 12347.0 | 2 | 3 | 71 |
| 2 | 12348.0 | 73 | 2 | 20 |
| 3 | 12349.0 | 42 | 3 | 107 |
| 4 | 12351.0 | 10 | 3 | 21 |
df_user.Frequency.describe()
count 4383.000000 mean 95.262149 std 204.903882 min 1.000000 25% 18.000000 50% 44.000000 75% 103.000000 max 5710.000000 Name: Frequency, dtype: float64
Average Frequency is 95 transactions per customer and median Frequency is 44 transactions.
#plot a histogram to show the frequency distribution
fig = px.histogram(df_user, x="Frequency",labels={'Frequency':'Frequency (#transactions)'},
title='Frequency distribution')
fig.show()
Frequncy dropped sharpy from count around 100.
K-means Clustering to assign frequency score
Similarly, apply no of clusters = 4 for K-Means clustering and predict the cluster number for each custemer.
kmeans = KMeans(n_clusters=4)
kmeans.fit(df_user[['Frequency']])
df_user[['FreqCluster']] = kmeans.predict(df_user[['Frequency']])
#re-order the cluster#, the higher cluster#, the more frequent visiter
df_user = cluster_reorder(df_user,'FreqCluster','Frequency',True)
df_user.groupby('FreqCluster')['Frequency'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| FreqCluster | ||||||||
| 0 | 3897.0 | 52.762381 | 48.222768 | 1.0 | 16.00 | 37.0 | 77.00 | 203.0 |
| 1 | 462.0 | 353.277056 | 155.084614 | 204.0 | 239.25 | 295.5 | 417.75 | 973.0 |
| 2 | 21.0 | 1616.857143 | 548.766005 | 1058.0 | 1213.00 | 1377.0 | 1826.00 | 2710.0 |
| 3 | 3.0 | 4917.000000 | 907.677806 | 3927.0 | 4520.50 | 5114.0 | 5412.00 | 5710.0 |
Similarly, cluster 3 is the best with most frequent visits, and cluster 0 is the worst.
Monetary
We will calculate the total sales generated by each customer. Then, we will apply K-means clustering to assign each customer a Monetary score.
#create a col for TotalSales
df['TotalSales'] = df.Quantity*df.Price
#get Total sales amount for each customer
df_money = df.groupby('Customer ID')['TotalSales'].sum().reset_index()
df_money.columns=['CustomerID','Monetary']
#merge the frequency col to our existing dataframe containing Recency scores
df_user = df_user.merge(df_money,on='CustomerID')
df_user.head()
| CustomerID | Recency | RecencyCluster | Frequency | FreqCluster | Monetary | |
|---|---|---|---|---|---|---|
| 0 | 12346.0 | 66 | 2 | 46 | 0 | -64.68 |
| 1 | 12347.0 | 2 | 3 | 71 | 0 | 1323.32 |
| 2 | 12348.0 | 73 | 2 | 20 | 0 | 222.16 |
| 3 | 12349.0 | 42 | 3 | 107 | 0 | 2646.99 |
| 4 | 12351.0 | 10 | 3 | 21 | 0 | 300.93 |
df_user.Monetary.describe()
count 4383.000000 mean 1904.679118 std 8519.369281 min -25111.090000 25% 285.260000 50% 655.940000 75% 1645.690000 max 341776.730000 Name: Monetary, dtype: float64
Average monetary value is 1905 per customer and median monetary value is 656. Some customers carried negative transaction value too.
#plot a histogram to show the Monetary distribution
fig = px.histogram(df_user, x="Monetary",labels={'Monetary':'Transaction value (in sterling (£))'},
title='Transaction value distribution')
fig.show()
Most are spending less than 1000.
K-means Clustering to assign Monetary score
Similarly, apply no of clusters = 4 for K-Means clustering and predict the cluster number for each custemer.
kmeans = KMeans(n_clusters=4)
kmeans.fit(df_user[['Monetary']])
df_user['MonetaryCluster'] = kmeans.predict(df_user[['Monetary']])
#re-order the cluster#, the higher cluster#, the higher spenders
df_user = cluster_reorder(df_user,'MonetaryCluster','Monetary',True)
df_user.groupby('MonetaryCluster')['Monetary'].describe()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| MonetaryCluster | ||||||||
| 0 | 4310.0 | 1268.350685 | 1817.011407 | -25111.09 | 280.1525 | 638.97 | 1553.6125 | 12744.37 |
| 1 | 65.0 | 24698.845431 | 11480.738592 | 13189.91 | 16808.0300 | 19956.55 | 29175.4100 | 55828.31 |
| 2 | 6.0 | 115093.731667 | 42289.451857 | 75610.17 | 82562.2050 | 105966.51 | 135300.0375 | 183180.55 |
| 3 | 2.0 | 292814.890000 | 69242.498167 | 243853.05 | 268333.9700 | 292814.89 | 317295.8100 | 341776.73 |
cluster_count = pd.melt(df_user,id_vars=['CustomerID'],value_vars=['RecencyCluster','FreqCluster','MonetaryCluster'])
cluster_count.columns=['CustomerID','cluster','score']
cluster_count = cluster_count.groupby(['cluster','score'])['CustomerID'].count().reset_index()
cluster_count.columns=['cluster','score','count']
cluster_count
| cluster | score | count | |
|---|---|---|---|
| 0 | FreqCluster | 0 | 3897 |
| 1 | FreqCluster | 1 | 462 |
| 2 | FreqCluster | 2 | 21 |
| 3 | FreqCluster | 3 | 3 |
| 4 | MonetaryCluster | 0 | 4310 |
| 5 | MonetaryCluster | 1 | 65 |
| 6 | MonetaryCluster | 2 | 6 |
| 7 | MonetaryCluster | 3 | 2 |
| 8 | RecencyCluster | 0 | 473 |
| 9 | RecencyCluster | 1 | 661 |
| 10 | RecencyCluster | 2 | 1149 |
| 11 | RecencyCluster | 3 | 2100 |
fig = px.bar(cluster_count, x='cluster',y='count',color='score',title='RFM Score Counts')
fig.show()
Frequency and Monetary Clusters are dominated with 0 score, while Recency cluster score distribution is rather more evenly distributed.
g = sns.relplot(data=df_user,x='RecencyCluster',y='MonetaryCluster',col='FreqCluster',kind="scatter",height=3.5)
g.fig.subplots_adjust(top=0.7) # adjust the Figure in rp
g.fig.suptitle('Recency vs Monetary per Frequency cluster')
Text(0.5, 0.98, 'Recency vs Monetary per Frequency cluster')
Overall RFM Score
We have created Recency, Frequency and Monetary scores for each customer, we can sum up them to give a total score. The higher the score, the more valuable the customer is.
#sum up all individual RFM clusters to get a total score
df_user['TotalScore'] = df_user['RecencyCluster'] + df_user['FreqCluster'] + df_user['MonetaryCluster']
#dive into total score cluster and see the corresponding RFM details
total_score = df_user.groupby('TotalScore')[['Recency','Frequency','Monetary']].median().round()
total_score_count = df_user.TotalScore.value_counts().reset_index()
total_score_count.columns=['TotalScore','count']
# a summary of RFM mean values for each total score level.
print('Median RFM values & Size for each total score group:')
total_score.merge(total_score_count,on='TotalScore')
Median RFM values & Size for each total score group:
| TotalScore | Recency | Frequency | Monetary | count | |
|---|---|---|---|---|---|
| 0 | 0 | 302.0 | 15.0 | 215.0 | 471 |
| 1 | 1 | 186.0 | 23.0 | 325.0 | 653 |
| 2 | 2 | 71.0 | 39.0 | 623.0 | 1094 |
| 3 | 3 | 18.0 | 59.0 | 888.0 | 1736 |
| 4 | 4 | 11.0 | 296.0 | 4001.0 | 370 |
| 5 | 5 | 7.0 | 446.0 | 17508.0 | 37 |
| 6 | 6 | 3.0 | 1302.0 | 30364.0 | 15 |
| 7 | 7 | 1.0 | 2710.0 | 83761.0 | 5 |
| 8 | 8 | 4.0 | 3758.0 | 190764.0 | 2 |
plt.figure()
df_user['TotalScore'].plot(kind='hist',alpha=0.6)
plt.xlabel("Total Score")
plt.ylabel("Count")
plt.title("Total Score distribution")
plt.axvline(x=2,color='purple', linestyle='--',alpha=0.5)
plt.axvline(x=3,color='purple', linestyle='--',alpha=0.5)
plt.axvline(x=5,color='purple', linestyle='--',alpha=0.5)
plt.show()
As shown in above graphs, for the given dataset, the Recency, Frequency and Monetary absolute value distribution skewed towards right, which means most customers visited less frequent and consumed less amount but quite a lot are recent visitors. We have most customers filled in the low Frequency & Monetary clusters, while recency clusters are rather evenly distributed and total score of most customers is no more than 3.
With reference of above summary, we can group the customers into 5 segments based on the individual RFM scores and total score:
Champions (R>0,total score>=5) are your best customers, who bought most recently, most often, and are heavy spenders. Reward these customers. They can become early adopters for new products and will help promote your brand.
Loyalists (R>0,F>=1,3<=total score<5) are these recent & frequenct visitors and who spent a good amount. Offer membership or loyalty programs or recommend related products to upsell them and help them become your Champions.
New Customers (R=3,F=0) are your customers who visited recently but are not frequent shoppers. Start building relationships with these customers by providing onboarding support and special offers to increase their visits.
Can’t Lose Them (R<3,total score>=2) are customers who used to visit and purchase quite often, but haven’t been visiting recently. Bring them back with relevant promotions, and run surveys to find out what went wrong and avoid losing them to a competitor.
Slipping Great past customers who haven't bought in awhile.Apply retention strategies to get them back into your business.
#create the segment column to group the customers
df_user['Segment'] = 'Slipping'
df_user.loc[(df_user['RecencyCluster']>0)&(df_user['FreqCluster']>=1)&(df_user['TotalScore']>=3),'Segment'] = 'Loyalists'
df_user.loc[(df_user['RecencyCluster']>0)&(df_user['TotalScore']>=5),'Segment'] = 'Champions'
df_user.loc[(df_user['RecencyCluster']==3)&(df_user['FreqCluster']==0),'Segment'] = 'New Customers'
df_user.loc[(df_user['RecencyCluster']<3)&(df_user['TotalScore']>=2),'Segment'] = 'Can’t Lose Them'
#summary to show the RFM mean values for each segment
seg = df_user.groupby('Segment')[['Recency','Frequency','Monetary']].median().round().sort_values(by='Recency')
seg_count = df_user.groupby('Segment')['CustomerID'].count()
seg = seg.merge(seg_count,on='Segment')
seg.rename(columns={'CustomerID':'count'},inplace=True)
seg
| Recency | Frequency | Monetary | count | |
|---|---|---|---|---|
| Segment | ||||
| Champions | 4 | 689 | 20306.0 | 57 |
| Loyalists | 11 | 300 | 3921.0 | 356 |
| New Customers | 17 | 57 | 854.0 | 1687 |
| Can’t Lose Them | 71 | 41 | 657.0 | 1159 |
| Slipping | 231 | 19 | 276.0 | 1124 |
There are only 57 customers in Champions group with RFM means values at R=7, Frequncy=1115, Monetary=41468. Following that we have the loyalists who would be the potential champions visiting quite recently with moderate spening amount. Then the new customers are whom we would want to turn into loyal customers. The 'can't lost them' and slipping groups are those who are slipping away, we would need retention strategies to keep them.
g = sns.barplot(data=seg,x=seg.index,y='count',alpha=0.8)
g.set_title('Segment distribution')
Text(0.5, 1.0, 'Segment distribution')
Most of the customers are newly boarding ones.
fig,ax = plt.subplots(1,3,figsize=(16,4))
fig.subplots_adjust(wspace=.1,hspace=.1)
sns.barplot(data=seg,x='Recency',y=seg.index,ax=ax[0])
ax[0].set_title('Recency')
ax[0].set(ylabel=None)
ax[0].set(xlabel=None)
sns.barplot(data=seg,x='Frequency',y=seg.index,ax=ax[1])
ax[1].set_title('Frequency')
ax[1].set(xlabel=None)
ax[1].set(ylabel=None)
ax[1].set(yticklabels=[])
sns.barplot(data=seg,x='Monetary',y=seg.index,ax=ax[2])
ax[2].set_title('Total Transaction Amount')
ax[2].set(xlabel=None)
ax[2].set(ylabel=None)
ax[2].set(yticklabels=[])
[[Text(0, 0, ''), Text(0, 1, ''), Text(0, 2, ''), Text(0, 3, ''), Text(0, 4, '')]]
Champions visited recently, most frequently and were high spenders. The slipping group had not visited the store for long, more than 200 days (median), and spent a little.
fig = px.scatter(df_user, x='Recency', y='Monetary', color='Segment',
size='Frequency', title='Customer Segmentation',
category_orders={'Segment':['Champions','Loyalists','New Customers','Can’t Lose Them','Slipping']})
fig.show()
Bubble size represents frequency, the champions group climb along monetary axis with almost 0 recency and big bubble size.
fig = px.scatter_3d(df_user, x='Recency', y='Frequency', z='Monetary',
color='Segment',title='Customer Segmentation',
category_orders={'Segment':['Champions','Loyalists','New Customers','Can’t Lose Them','Slipping']})
fig.show()
3-D plot shows the distribution of customer segment more clearly.
Once we are done calculating the RFM scores, we are able to segment the customer pool and identify the best customers - the champion segment. Then we can start to analyse the characteristics and purchasing behavior of this group to figure out why they perceive more value in our business than the folks who seldom visit and consume less amount.
This will help us sharpen the understanding of the target market and be more precise in communicating with actual and potential customers.